Predictive Active Set Selection Methods for Gaussian Processes 



Ricardo Hcnao, Ole Winthcr 

DTU Informatics, Technical University of Denmark, Denmark 
Bioinformatics Centre, University of Copenhagen, Denmark 



Abstract 



o 

(N 
C 

CO 
(N 



We propose an active set selection framework for Gaussian process classification for cases when the datasct is large 
enough to render its inference prohibitive. Our scheme consists of a two step alternating procedure of active set update 
rules and hypcrparamctcr optimization based upon marginal likelihood maximization. The active set update rules rely 
'on the ability of the predictive distributions of a Gaussian process classifier to estimate the relative contribution of a 
datapoint when being either included or removed from the model. This means that we can use it to include points 
with potentially high impact to the classifier decision process while removing those that are less relevant. We introduce 
.two active set rules based on different criteria, the first one prefers a model with interpretable active set parameters 
whereas the second puts computational complexity first, thus a model with active set parameters that directly control 
its complexity. We also provide both theoretical and empirical support for our active set selection strategy being 
a good approximation of a full Gaussian process classifier. Our extensive experiments show that our approach can 
1 compete with state-of-the-art classification techniques with reasonable time complexity. Source code publicly available 
at http : / / cogsys . imm.dtu.dk/passgp. 
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1. Introduction 

Classification with Gaussian process (GP) priors has 
many attractive features, for instance it is non-parametric, 
exceptionally flexible through covariance function designs, 
provides fully probabilistic outputs and Bayesian model 
comparison as principled framework for automatic hyper- 
paramctcr clicitation and variable selection. However, 
such a set of features comes in with a great disadvan- 
tage since the computational cost of performing inference 
scales cubically with the size, N, of the training set. In ad- 
dition, the memory requirements scale quadratically also 
with N. This means that applicability of Gaussian pro- 
cess classifiers (GPCs) is sadly limited to problems with 
dataset sizes in the lower ten thousands. The poor scaling 
of specially non-linear classification methods has inspired 
a considerable amount of research effort focused on sparse 
approximations [1, 2, 3, 4, 5, 6, 7]. See particularly [1, 2] 
for a detailed overview of sparse approximations in GPCs. 
These methods attempt in general to decrease the com- 
putational cost of inference in one degree w.r.t. TV, i.e. 
0{NM 2 ), where M < N and M is the size of a working 
set consisting on a subset of the training data or a set of 
auxiliary unobserved variables. Both ways of defining the 
working set basically target the same objective of getting 
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as close as possible to the classifier that uses the informa- 
tion of the entire training set, however they approach it 
from different angles. Using a subset from the entire data 
pool amounts to keep those data points that better con- 
tribute to the classification task and discard the remaining 
ones through some suitable data selection/ranking proce- 
dure [8, 9, 10, 6, 7]. Alternatively, building an auxiliary 
set tries to directly reduce the difference in distribution 
between the classifier using N points and the one using 
only M, by estimating the location of an auxiliary set in 
the input space, usually called pseudo-input set [1, 4, 11]. 
The latter approach is evidently more principled, however 
the number of parameters to be learnt grows with the num- 
ber and size of the auxiliary set, making it unfeasible for 
datasets in the upper ten thousands and sensitive to over- 
fitting due to the number of free parameters in the model. 
From a fully Bayesian perspective, in [12] the authors pro- 
pose an efficient MCMC based inference that is made pos- 
sible by using a sparse and approximate basis function 
expansion over the training data set. The main computa- 
tional burden is therefore the same as other sparse kernel 
methods possibly with a larger pre-factor due to sampling. 

Having in mind that our main goal is to obtain the best 
classification performance with the least computational 
cost possible, we do not attempt to estimate auxiliary sets 
but rather to select a subset of the training data. The 
framework presented here, Predictive Active Set Selection 
(PASS-GP) uses the predictive distribution of a GPC in 
order to quantify the relative importance of each data- 
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point and then use it to iteratively update an active set. 
Recently, [6] proposed a similar criterion in the context of 
active learning with Gaussian processes. We use the term 
active set because it is ultimately the one used to estimate 
the predictive distribution that produces the classification 
rule and active set updating scheme. In a nutshell, our 
framework consists of alternating between active set up- 
dates and hyperparameter optimization based upon the 
marginal likelihood of the active set. We provide two ac- 
tive set update schemes that target different practical sce- 
narios. The first simply called PASS-GP builds the active 
set by including/removing points with small/large predic- 
tive probability until no more or too few data points are 
included in the active set. This means that the size of the 
active set is not known in advance so as the expected com- 
putational complexity. The second scheme is aware that 
in some applications is very important to keep the com- 
putational complexity and/or memory requirements on a 
budget, thus being able to specify the size of the active set 
beforehand is essential. In fixed PASS-GP (fPASS-GP) we 
keep the size of the active set constant by including and 
removing the same amount of data points in each update 
to achieve the desired behavior. 

The remainder of the paper presents in Section 2 a con- 
cise description of expectation propagation based inference 
for GPCs. Section 3 continues with our proposed frame- 
work for active set selection, followed by some theoretical 
insights based upon a 'representer theorem' for the pre- 
dictive mean of a GP classifier in Section 4. Marginal 
likelihood approximations to the full GP classifier are in- 
troduced in section 5. Finally, experimental results and 
discussion appear in Sections 6 and 7, respectively. 



2. Gaussian Processes for Classification 

Given a set of input random variables X = 
[xi, . . . , xtv] t , a Gaussian process is defined as a joint 
Gaussian distribution over function values at the input 
points f = [/i, . . . , /at] t with mean vector m (taken to be 
zero in the following) and covariance matrix K with ele- 
ments Kij = fc(xi,Xj) and hypcrparametcrs 9. For clas- 
sification, assuming independently observed binary ±1 la- 
bels y = [j/i, . . . , un] T and a probit (cumulative Gaussian) 



likelihood function t(y n \f n ) 
intractable posterior 



$(/ n y n ), we end up with an 



N 



o(f\X,y) = Z- 1 p(f\X)l[[ t(y n \f n ) , 



71=1 



where the normalizing constant Z = p(y|X) is the 
marginal likelihood. If we want to perform inference we 
must resort to approximations. Here we use Expectation 
Propagation (EP) because it is currently the most accu- 
rate deterministic approximation, see e.g. [2, 13]. In EP, 
the likelihood function is locally approximated by an un- 



normalizcd Gaussian distribution to obtain 

N 

q(f\X,y) = Z£p(f\X) J] z-^iyMn) 

n=l 

= Z^p(i\X)Af(i\ih,C) , 
= Af(f|m,c) , 



(1) 



where g(f|X,y) w p(f|X,y), the z n are the normaliza- 
tion coefficients, t(y n \f n ) and J\f(i |m, C) conform the site 
Gaussian approximations to t(y n \f n ). In order to obtain 
q(i\X, y), one starts from g(f|X,y) = p(t\X,y) and up- 
date the individual t n site approximations sequentially. 
For this purpose, we delete the site approximation t n from 
the current posterior leading to the so called cavity distri- 
bution 

g\ n (f|X,y Nn ) =p({\X)]Jz^t(y i \f i ) , 
from which we can obtain a cavity predictive distribution 



9\n(j/n|X,y\„ 



%n|/n)?\n( f |X,y\ n )df , 



y n m v 



y/1 + V\ r 



(2) 



where m\ n = v\ n {C m \m n - C n ^m n ) and v\ n = (C n £ - 

C~^) _1 . We then combine the cavity distribution with 
the exact likelihood t(y n \f n ), to obtain the so called tilted 
distribution g„(f|X,y) = z~ 1 t(y n \f n )q\ n (i\X,y\ n ). Since 
we need to choose the parameters of the site approxima- 
tions we must minimize some divergence measure. It is 
well known that when g(f|X, y) is Gaussian, minimizing 
KL(p(f)||q<(f)) is equivalent to moment matching between 
those two distributions including zero-th moments for the 
normalizing constants. The EP algorithm iterates by up- 
dating each site approximation in turn and makes several 
passes over the training data. 

With the Gaussian approximation to the posterior dis- 
tribution in equation (1), it is possible to calculate the 
predictive distribution of a new datapoint x* as 



,y,x*) = / t(y*\r)q(r\X,y,x*)df* 



= $ 



/ y*m* 



(3) 



where q(f*\X, y, x*) is the approximate predictive Gaus- 
sian distribution (the marginal of q(f, /*|X, y, x*) w.r.t. 
f) with mean m* = k* T (K + C) _1 m and variance v* = 
k** — k* T (K + C) _1 k*. In addition, the approximation to 
the marginal likelihood p(y|X) results in the normaliza- 
tion constant from equation (1), i.e. g(y|X) = Zep- The 
logarithm of Zep(0 ,X,y) and its derivatives can be used 
jointly with conjugate gradient updates to perform model 
selection under the evidence maximization framework. For 
a detailed presentation of GP including its implementation 
details, consult [2, 13]. 
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3. Predictive Active Set Selection 

The EP algorithm is performed by iterative updates of 
each site approximation using the whole dataset {X,y}. 
In the active set scenario on the other hand, we only want 
to approximate the posterior distribution in equation (1) 
using a small subset, the active set {X^y^}. Since ex- 
ploring all possible active sets is obviously intractable even 
for a fixed active set size M, the problem is how to select an 
active set that delivers a performance as good as possible 
within the available computing resources. The Informative 
Vector machine (IVM) [8] for instance, computes in each 
iteration the differential entropy score for all data points 
not already part of the active set {X/,y/} and perform 
updates by including the single point leading to a maxi- 
mum score. Despite this greedy heuristic, IVM has proved 
to behave quite well in practice, giving the so far best re- 
ported GP performance on the USPS and MNIST tasks 
[8, 9]. We propose an iterative approach in the same spirit 
with two main conceptual changes: 

• Active set inclusion/deletion based directly upon 
the data point weight in prediction. The 'representer 
theorem' for the mean prediction, discussed in Section 4, 
leads directly to the weight being expressed in terms of 
(a derivative of) the cavity predictive probability. This 
means that we can actually use the predictive distribu- 
tion for a point in the inactive set to predict the weight 
it would have if it would be included in the active set. 
For classification we use the (cavity) predictive proba- 
bility to decide upon deletion and inclusion because it 
is monotonically related to weight and it is a readily 
interpretable quantity. 

• Hyperparameter optimization must be an integral 
part of algorithm, because the weights of the examples 
(and thus the active set) is conditioned on the hyperpa- 
rameter values and vice versa. We therefore alternate 
between active set updates and hyperparameter opti- 
mization using several passes over the data set. 

Next we discuss the details of our (f)PASS-GP framework 
followed by a detailed comparison with the IVM. First we 
need to define rules for including and deleting points of 
the active set. As already mentioned, we use the predic- 
tive distribution in equation (3) for inclusions since data 
points with small predictive probability are more likely to 
contribute to improve the classifier performance and the 
quality of the active set. For deletions, we use the cavity 
predictive distribution in equation (2) because when ex- 
amined carefully it can be seen as a leave-one-out estima- 
tor [14]. This means that points with cavity probability 
close to one do not contribute to the decision rule thus 
they can be discarded from the active set. With the two 
ranking measures set, i.e. equations (2) and (3), we have 
essentially two possibilities. The first is to set probability 
thresholds on the distributions and let the model decide 
the size of the active set or we can rather specify directly 



the amount of inclusions/ deletions. In PASS-GP, we in- 
clude points in the active set with probability less than 
Pi nc and remove them with probability greater than pdei- 
The appealing aspect of these thresholds is that they can 
be interpreted, for instance if we set p mc = 0.5 we will 
include all misclassified observations in the current active 
set whereas if p lnc = 0.6 we will also include points near 
the decision boundary. We require two thresholds because 
we only want to remove points that as for the classifier 
are very easy to classify, so unlike p lnc , Pdc\ must be close 
enough to one. In fPASS-GP, we want to keep the compu- 
tational complexity of the classifier under control thereby 
we want the size of the active set to be fixed. For this 
purpose we only have to be sure that each active set up- 
date includes and removes the same amount of points. In 
practice we define p cxc as the exchange proportion w.r.t. 
M, meaning that each update replaces the fixed propor- 
tion of most hard to classify points in the inactive set with 
those more surely classified in the current active set. This 
update rule assumes that the active set is large enough 
to contain points in the active set with cavity probability 
close to one. 

From a practical point of view, ranking every point in 
the inactive set at each iteration for inclusion could be- 
come prohibitive for large datasets. However we still want 
to be able to cover the whole dataset rather than select- 
ing a random subset for ranking. We then split the data 
into iV su b non-overlapping subsets and process each one of 
them in each iteration, such that each batch has something 
between 100 and 1000 data points. 

Hyperparameter selection is a very important feature 
and needs to be done jointly with the active set update 
procedure. Algorithm 1 starts from a fixed randomly se- 
lected active set of size N- in i t (that is M in fPASS-GP), 
large enough to provide a good initial hyperparameter set 
values. Next we alternate between active set and hyper- 
parameter optimization updates. Having in mind that we 
only expect small changes of the hyperparameters from 
one iteration to another, we reuse current values of 6 as 
initial values for the next iteration to speed-up the learn- 
ing process. The addition and deletion rules in Algorithm 
1 have parameters {pinoPdei} and p cxc for PASS-GP and 
fPASS-GP, respectively. 

3.1. Differences between (f)PASS-GP and IVM 

Since IVM is the closest relative of our active set selec- 
tion method, we briefly discuss the main differences be- 
tween the two: (i) The active set and thus the compu- 
tational complexity is usually fixed beforehand in IVM. 
PASS-GP works with inclusion and deletion thresholds in- 
stead, (ii) IVM does not allow for deletions from the active 
set which is a clear disadvantage as points often become 
irrelevant at a later stage, when more points have been 
included. In (f)PASS-GP we can make an (almost) unbi- 
ased common ranking of all training points active as well 
as inactive, using a quantity that is meaningful and di- 
rectly related to the weight of the training point in predic- 
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Algorithm 1: Predictive active set selection 



{X,y}, 6 and {N- mit , N auh , 
p inc and p dcl (PASS-GP) 
p cxc (fPASS-GP) 
q(f A \X-A,yA), Onew and A 



Input 
Input 
Input 
Output 
begin 

A <- {!,..., N init } 



A pass } 



{X,y}^ b , 



{X,y} 



1 to N pass do 



for i 

for j ' = 1 to A su b do 

new = argmax e log Z E p(0, Xa, ya) 
Get q(t A \X A ,y A ) and g(j/*|X A , yA, x*) 
for all the {x n ,y n } e {X A ,y A } do 
if RemoveRule(<7\„(y n |XA,yA\n)) 
then A <- 
end 

forall the {x„,y„} e {X,y}^ b do 
if AdditionRule(q(?/*|XA, Ya, x*, v)) 
then A<-AU {n} 
end 
end 
end 



end 



tions. Using both inclusions/deletions and several passes 
over the training set makes (f)PASS-GP quite insensitive 
to the initial choice of active set. (iii) When the dataset 
is considerably large, IVM randomly selects a subset of 
points to be ranked from the inactive set, meaning that 
is likely that some points of the dataset are never consid- 
ered for inclusion in the active set. (iv) The hyperparame- 
ter optimization is a part of the algorithm in (f)PASS-GP 
working on subsets of data between updates and iterating 
over the data set several times. IVM makes a single inclu- 
sion per step and in principle stops when the limit for the 
active set is reached, (iv) In terms of complexity time per 
iteration IVM is faster than (f)PASS-GP, O(NM) against 
0(M 2 (2 + N/N Buh )) where M is the size of A, however 
storage requirements are considerable lower, 0(M 2 ) com- 
pared to O(NM). 



4. Representer for Mean Prediction 

The 'representer theorem' for the posterior mean of f 
[14], connects the predictive probability and the weight of 
a data point. Using that p(f|X) = — Kjyp(f|X), we get 



the exact relation for the posterior mean ( f ) 
the weight of element n being 



K« with 



1 



p(y|x 

(p'(y n \f, 



p(f|X)^-p(y|f)df 

Ofn 



\n 



[p{Vn\fn 



gh lo S {P(yn\fn + h))\ n 



where (■) \ n = m \n denotes an average over a pos- 
terior without the n-th data point and p'(y n \fn) = 
9p(y n \fn)/dfn- The final expression implies that the 
weight is nothing but the log derivative of the cavity pre- 
dictive probability (p(y n \f n ))\„ = p(y n |X, y\„). For 
regression, p(y n \f„) = N{y n \f n , <j 2 ) and a„ = (y„ - 
( fn > \n)(a 2 +v > )- 1 with v\ n = ( f 2 n ) v „ - ( /„ ) \ n . The 
element a n will therefore be small when the cavity mean 
has a small deviation from the target relative to the vari- 
ance. For a new data point pair {x*,y*}, we can calcu- 
late the weight of this point exactly, replacing the cav- 
ity average with the full average in the expression above. 
We can therefore predict without any EP rerunning, how 
much weight this new point will have. For classification 
we can calculate the weight using the current EP approx- 
imation. When z n = y n ( /„ ) \„/ ^/l + v\ n is above rj 4, 
the cavity probability equation (2) approaches one and 
a„ w y n exp(— z 2 /2)/^2tt(1 + v\ n ). This fast decay in- 
dicates that GPC in many cases will be effectively sparse 
even though a strictly does not contain zeros. 

In the inclusion/deletion steps we rank data points ac- 
cording to their weights. For classification we can indeed 
use the predictive probability directly, since it is a mono- 
tonic function of the weight. Including a new data point 
will of course affect the value of all other weights as well 
leading to a rearrangement of their rank. Including mul- 
tiple data points will also invalidate the predicted value 
of the weights (e.g. think of the extreme of two new data 
points being identical). We therefore have to recalculate 
the weights by retraining with EP for classification or sim- 
ply updating the posterior for regression before going to 
the next step. If we have already an active set covering 
the decision regions well enough, this rearrangement step 
will amount to minor adjustments and the approximation 
will work well. 

In this work we have only used the representer theorem 
for active set selection. It is also possible, but not tested 
here, to use all training points for prediction while only 
calculating the posterior on the active set. The inactive 
set weights are then simply set to the predicted values 
from the active set posterior. To get the full predictive 
probability one also has to calculate the contribution to 
the predictive variances which can be obtained by a similar 
theorem but for the predictive variance, see [14]. 



5. Marginal Likelihood Approximations 

In this section we decompose the marginal likelihood in 
their active and inactive set contributions. We will argue 
that the contribution from the active set will dominate, 
justifying why we can limit ourselves to optimizing the 
hyperparameters over this set. In the following section we 
will investigate this assumption empirically. The marginal 
likelihood can be decomposed via the chain rule as 



h=0 



p(yl x ) = p(y/|yA,XA,X/)p(yA|XA) 



(4) 



4 



where we have used the marginalization property of GPs, 

p(jM|X) = J p(y A \f A )p(f A \X A )df A = p(y A \X A ) , 

that we approximate as q(y A \X. A ) = Z^p.a and we iden- 
tify it as the marginal likelihood for the active set A. The 
conditional marginal likelihood term can be written as 

p(yj|yA,XA,X/) = 

J p(y I \f I )p(f I \^ I> ^ A ,f A ) P (f A \X A ,y A )df A df I , (5) 

where we used p(f|X) = p(f/|X/, X A , f A )p(i A \X A} y A ). 
We can make an EP approximation here just like in equa- 
tion (1) by replacing the posterior p(f A \~K A ,y A ) by the 
multivariate Gaussian q(f A \X A ,y A ) = Af(f A \m A ,C AA ) 
where active set specific means and variances are found 
by EP. Marginalizing over f A in equation (5) makes it now 
tractable 

q(yi\y A ,X A ,Xi) « J p(y / |f J )^(f 7 |m / | A , C II]A )dfi , 

with parameters 

m^A = K IA (K AA + C AA y 1 m A , 
Ch\a = K/7 - Kia(Kaa + C AA )^ K AI , 

where the tilted moments defined in Section 2. 

When the inactive set consists of a single example, we 
obtain the EP predictive distribution in equation (3), oth- 
erwise we have to solve for a new marginal likelihood. De- 
noting the marginal likelihood for a set {X, y} with a non- 
zero mean GP prior by 

Z(0,X,y,m) = J p(y\f)Af(i\m,K)df , 

and its EP approximation by Z-ep(9, X, y, m), we can 
write the approximation to the marginal likelihood in 
equation (4) as 

Zacc = Z E p(e,X,y I ,m I i A )Z E p(e,X,y A ,0) ■ 

Using this approximate decomposition reduces the com- 
plexity of EP from 0(N 3 N pass ) to 0((\I\ 3 + M 3 )N pass ), 
where |/| is the size of the inactive set. Unfortunately 
this is still too costly for large N. A final low complexity 
approximation to the marginal likelihood, that we denote 
by Zapp, is to replace p(yi\y A ,X) with the product of 
marginals p{yi\y A , Xa, Xj). Empirically — see Figure 
3, this approximation turns out to be lower than the actual 
marginal likelihood, i.e. the joint distribution enforces the 
labels relative to the product of the marginals. 

6. Experiments 

The results presented in this section consist of sev- 
eral classification tasks performed on three well known 



datascts, namely USPS, MNIST and IJCNN. The first 
two correspond to handwritten digit databases while the 
third is a physical system inspired dataset assembled for 
the IJCNN 2001 neural network competition. We com- 
pare the two approaches introduced in section 3 against 
the IVM and Reduced complexity SVM (RSVM) [3]. We 
consider as performance measures not only classification 
errors, but the error-cost trade-off and prediction uncer- 
tainty. We also present results for the approximation to 
the marginal likelihood of the full GP presented in section 
5. All experiments were performed on a 2.0GHz desktop 
machine with 2GB RAM. 

6.1. USPS 

The USPS digits database contains 9289 grayscale im- 
ages of size 16 x 16 pixels, scaled and translated to fall 
within the range from —1 to 1. Here we adopt the tra- 
ditional data splitting, i.e. 7291 observations for training 
and the remaining 2007 for testing. For each binary one- 
against-rest classifier we use the same model setup consist- 
ing of a squared exponential covariance matrix plus addi- 
tive jitter 

fc( Xl , x ,) = e 1 cx P (- l|xi 2g ^' 112 ) + h&u (6) 

where Sij = 1 if i = j and zero otherwise. We have three 
hypcrparametcrs in 0, namely, signal variance, character- 
istic length scale and jitter coefficient. Provided that the 
four active set methods being considered may depend upon 
random initialization we repeated all tasks 10 times. Indi- 
vidual settings for each method are: 

• PASS-GP: N init = 300, A^ sub = 10, N p , ss = 2, p inc = 0.6 
and pdei =0.99. 

• fPASS-GP: A init = 300, A sub = 10, A^ pass = 4, Pexc = 
0.02. We allow fPASS-GP to perform more passes 
through the data because fPASS-GP progresses slower 
due to Pexc being small. 

• RVM: M = 500, 9 = [1 1/16 0], C = 10 and « = 10. 
More precisely, 6 and the rcgularization parameter, C, 
were obtained by grid search cross-validation, while k 
was set to the value suggested by the authors of [3]. 

• IVM: M = 300 and Ap ass = 8. In the publicly avail- 
able version of IVM, hypcrparameter selection is done 
by alternating between full active set selection and hy- 
pcrparameter optimization. Since IVM starts from an 
empty active set, it can be very sensitive to the initial 
values of 9. We experienced however that by adding a 
linear term, 04X.Jx.j, in the covariance matrix in equa- 
tion (6) makes IVM quite insensitive to initialization. 
The results reported here include such a linear term be- 
cause we found that using equation (6) alone makes the 
IVM to perform very poorly. 

Figure 1(a) shows mean test errors for every one-against- 
rest task using PASS-GP, fPASS-GP, RSVM, IVM and 
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Figure 1: Error rates and active set sizes for USPS data, (a) mean 
classification errors for each digit using PASS-GP, fPASS-GP, RSVM, 
IVM and the full GPC with hyperparamctcr optimization, (b) Active 
set sizes for PASS-GP. Note that fPASS-GP and IVM use M = 300, 
whereas RSVM uses M = 500 for the results in (a). Error bars are 
standard deviations over 10 repetitions. 



the full GPC with hyperparameter optimization. Besides, 
Figure 1(b) shows the active set sizes for each digit using 
PASS-GP. From the figure, it can be seen that Gaussian 
process based active set methods perform similarly still 
slightly better than the RVM. The full GPC was only 
ran once due to its computational requirements, which 
explains the lack of error bars in Figure 1(a). Further- 
more, compared to fPASS-GP, IVM (M = 300) and RSVM 
(M = 500), PASS-GP seems to require smaller active sets 
to achieve similar classification performance. It is impor- 
tant to mention that we also tried larger values of M for 
the fixed active set algorithms but without any significant 
improvement in performance. 

Figure 2 show classification errors for digits 2 and 4 
against the others in top and bottom panels, respec- 
tively, as a function both of the active set size and run- 
ning time. For fPASS-GP, RSVM and IVM we used 
M = {200, ...,600} and for PASS-GP we used p inc = 
{0.2.0.3, . . . , 0.9}. We included also the classification er- 
ror obtained by the full GPC with hyperparameter opti- 
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Figure 2: Results for selected individual digits of USPS data, (a) 
and (c) show mean classification errors as a function of the active 
set size for digits 2 and 4 vs the rest, respectively, (b) and (c) 
show mean classification errors as a function of the running time, 
matching panels (a) and (c). The horizontal dashed line in all plots 
is the performance of the full GPC with hyperparameter selection. 
In both cases, the full GPC took approximately 8.6e5 seconds [7]. 
Values represent averages over ten independent repetitions with error 
bars omitted for clarity. 



mization depicted as an horizontal dashed line. See [7] for 
a more detailed comparison between PASS-GP and full 
GPCs. Several features from Figure 2 worth to be high- 
lighted, (i) Gaussian process based methods approach the 
full GP for large values of M, as expected, (ii) Similar to 
Figure 1(a), PASS-GP seems to consistently outperform 
fPASS-GP for similar sizes of M. (iii) For small values 
of M, RSVM and IVM perform better than our active 
set methods, however further increasing M does not con- 
siderably improves their performance. When M is small 
enough, it is very likely that our approaches are not able to 
obtain plausible estimates of the hyperparameters of the 
covariance function, thus its poor performance compared 
to RSVM that uses fixed values. Provided that the full 
GPC takes 8.6e5 seconds to run, PASS-GP and fPASS-GP 
are approximately three orders of magnitude faster than 
the full GPC with hyperparameter optimization, sec [7]. 
Form Figures 2(b) and 2(d), we see that for similar active 
set sizes, PASS-GP and fPASS-GP have comparable com- 
putational costs as one may expect. Similarly, RSVM and 
IVM scale better than our active set selection methods. In 
terms of error-cost trade-off, RVM has a clear edge while 
the Gaussian process based methods can be regarded as 
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Figure 3: Marginal log-likelihood approximations as a function of the active set size for digit 3 vs the rest. The plots show means and 
standard deviations (error bars) over ten repetitions. Each marker indicates a different inclusion threshold pi nc = {0.5,0.6, . . . ,0.9,0.99}. In 
the left panel, Zgp is for the full GPC (pi nc = 1), ^ep,A f° r the active set only and the remaining two, ZpxiC an d Zapp> are the proposed 
approximations. The middle and right panels shows computation times required to compute {Zapp, -ZTep.a} and ZacCi respectively. 



comparable. It is important to note that for RVM, the 
difference in computational costs as seen in Figures 2(b) 
and 2(d) should not be considered as significant since we 
are not counting the time used to obtain the parameters 
used by the RSVM, that unfortunately need to be selected 
by expensive grid search with cross-validation. The IVM 
turned out to be time-wise comparable to our active set 
methods not because its selection procedure but due to 
the hyperparameter optimization scheme used. 

The results obtained on USPS suggest that (f)PASS- 
GP is performing slightly better than the full GPC. This 
could be due to numerical instability produced by the size 
of the problem, by the iterative nature of the EP algo- 
rithm and/or not enough iterations for the hyperparame- 
ter selection procedure. However, it could also mean that 
optimizing on the active set achieves a better "local" fit 
around the decision boundary region. A priori, one cannot 
expect that a single set of hyperparameters is able to de- 
scribe all regions in input space, thus every possible active 
set. The same kind of local improvement observed here 
was also reported by [15] and [4] using GPC auxiliary set 
methods. 

Combining the ten binary tasks into a one-against- 
rest multi-class classifier, PASS-CP obtained 4.51 ±0.17% 
which is comparable or better 1 than 4.61 ±0.11% by 
fPASS-GP, 4.88 ± 0.12% by RSVM and 4.38 ± 0.11% by 
IVM. Baselines are, 5.13% by GPC with hyperparameter 
optimization, 4.78% by GPC with fixed 6 and 9.75±0.40% 
by GPC with random active set selection. Other relevant 
results found in the literature include 5.15% by online GP 
[16] and 4.98% by IVM with randomized greedy selection 
[9]. All three Gaussian process based methods are com- 
parable with state-of-the-art techniques such as SVM, see 
[17]. It is worth pointing out that the best result we could 
obtain from IVM using the covariance matrix in equation 



1 Assuming independent errors, the standard deviation on the per- 
formance is \J e(l — e)/A/tcst giving approximately 0.4% for USPS 
and 0.1% for MNIST. 



(6) was 6.27 ± 0.21% for M = 1500 which is substantially 
worse than the performance of the full GPC. As reference, 
it has been shown that the human error rate is approxi- 
mately 2.5%. 

Next we want to evaluate the two approximations to 
the marginal likelihood proposed in Section 5. We pro- 
ceed by computing the accurate but expensive approx- 
imation Zacc, the less accurate but affordable ZJapp 
and the marginal likelihood of the full GPC and the 
active set, simply denoted as Zep and Zep,a, respec- 
tively. In order to show how the approximations de- 
pend on the size of the active set, we compute them for 
p inc = {0.5, 0.6, . . . , 0.9, 0.99, 1, }, with p inc = 1 being the 
full GPC. Figure 3 shows that the three approximations 
approach the marginal likelihood of the full GPC as the 
inclusion threshold and so the active set size increases. 
As expected, ZJacc is the best approximation, however 
the computational effort needed to compute it is roughly 
two orders of magnitude larger compared to the cost of 
computing Zapp and Zep,a- It is very interesting that 
even with large values of p- inc = 0.99 the size of the active 
set remains below 10% of the training data and the con- 
tribution to the log-marginal likelihood from the inactive 
Zep{9, X, yj, io.j\a) sc t basically vanishes, since Zapp and 
Zep.a are essentially the same. 

Finally, we want to asses the uncertainty of the pre- 
dictions made by the Gaussian process based methods by 
means of comparing the predictive probabilities with the 
true outcomes. Figure 4 shows estimated log predictive 
densities for PASS-GP, fPASS-GP, IVM and the full GPC, 
using all USPS predictions made on the test separated into 
correct and incorrect predictions. Assuming no labeling 
errors, the true density consists of two point mass densi- 
ties at {0, 1} provided our one-against-rest setting. As one 
might expect, the full GPC achieves the best approxima- 
tion, followed by fPASS-GP and PASS-GP. IVM suggests 
more predictive uncertainty because of the two "spurious" 
modes in Figure 4(a). Another way to asses the predic- 
tive uncertainty is to compute Brier scores, that measures 
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the average of square deviations between estimated and 
true predictive probabilities. For the USPS dataset we ob- 
tained: 0.53±0.03, 0.27±0.01, 0.71±0.02 and 0.14±0.00 
for PASS-GP, fPASS-GP, IVM and full GPC, respectively. 
Note that the Brier scores are in agreement to what we 
observe in Figure 4. 
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Figure 4: Predictive density estimation for USPS data. Densities for 
correct and incorrect predictions arc shown separately in (a) and (b), 
respectively. The ground truth for (a) is a two point mass mixture 
at {0, 1} and a flat distribution for (b). 



6.2. MNIST 

The MNIST digits database has 60000 and 10000 as 
training and testing examples respectively. Each example 
is a gray-scale image of 28 x 28 pixels. The estimated 
human test error is around 0.2%. The settings used for 
the algorithm are nearly the same as those for USPS with 
only two differences. iV su b is set 100 since the training set 
in MNIST is almost ten times larger than USPS and we 
are not updating the hyperparameter in each iteration but 
every 10-th, in order to make the training process faster. 
We also ran our algorithm with hyperparameter updates 
every single iteration without any noticeable improvement 
in performance (results not shown). Figure 5 shows test 
error rates, active set sizes, multi-class errors and running 
times for each binary classifier based on PASS-GP, fPASS- 
GP and RSVM using a 9-th degree polynomial covariance 
function 
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We use this covariance matrix instead for the standard 
squared exponential from equation (6), because a polyno- 
mial covariance is well known for providing optimal results 
for the MNIST dataset [18]. Results for the squared expo- 
nential covariance function can be found in [7] and confirm 
that the polynomial covariance behave slightly better for 
this dataset. For IVM we could not make the polynomial 
covariance to work properly, thus we decided to use the 
equation (6) plus a linear term like in the USPS experi- 
ment. 

From Figure 5(b) it can be seen that in every case the 
size of the active set is less than 4% of the training set. 
The results for fPASS-GP and RSVM were obtained us- 
ing M = 2000. We did try for larger values of M but the 
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Figure 5: Error rates, active set sizes and run times for MNIST data, 
(a) Mean classification errors for each digit task using PASS-GP, 
fPASS-GP, RSVM and IVM. (b) Active set sizes for PASS-GP. Note 
that fPASS-GP, RSVM and IVM use M = 2000. (c) Mean multi- 
class classification errors and (d) average timings over one-against- 
the-rest classifiers and repetitions. Error bars in (a), (c) and (d) are 
standard deviations computed over 10 repetitions of the experiment. 



reduction in error was not significant compared to the over- 
head in computational cost. Figure 5(a) shows the classi- 
fication error for each digit. The performance of the three 
approaches considered is comparable but letting PASS-GP 
with an edge over the other two, both in terms of error and 
variances. Figure 5(c) shows the results of combining the 
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Tabic 1: Results for USPS and MNIST using PASS-GP and active set invariances. Figures are averages over 10 and 5 repetitions, respectively. 



ten binary classifiers. Again, PASS-GP behaves slightly- 
better than the others, however when looking at the run 
times in Figure 5(d) we can see that RSVM is computa- 
tionally more affordable than our approaches, even more 
considering that it uses M = 2000. Comparing PASS- 
GP to fPASS-GP, the former has a smaller mean run time 
but with larger variance compared to the more expensive 
fPASS-GP. fPASS-GP is more stable time-wise, but takes 
more time because it uses a fixed M ~ 2000. 

As far as the authors know these are the first GP based 
results on MNIST using the whole database. IVM [8] with 
sub-sampled images of size 13 x 13 has been tried to pro- 
duce a test error rate of 1.54 ± 0.04%. Seeger [9] made 
additional tests on some digits (5, 8 and 9) on the full size 
images without any further improvement. On the other 
hand, PASS-GP is again comparable with state-of-the- 
art techniques not including preprocessing stages and/or 
data augmentation, for instance SVM is 1.4% and 1.22% 
using RBF and a 9-th degree polynomial kernel, respec- 
tively. The reported sizes of support vector sets are ap- 
proximately two times larger than our active sets [18]. 

6.3. Incorporating Invariances 

It has been shown that a good way to improve the over- 
all performance of a classifier is to incorporate additional 
prior knowledge in the training procedure particularly by 
means of externally handling invariances of the data. In 
[18], it is shown that instead of just dealing with the invari- 
ances by augmenting the original dataset — which turns 
out to be infeasible in many cases, it is better to augment 
only the support vector set of a SVM. We therefore try 
the same procedure as suggested in [18] consisting of four 
1-pixcl translations (left, up, right and down directions) 
on each element of the active set for USPS and eight 1- 
pixel translations (including diagonals as well) for MNIST, 
resulting in new training sets of size 5 x M and 9 x M, 
accordingly. In this case we have used the same settings 
as in the previous experiments with only two differences. 
First, the hyperparameters have been set to those found 
using the original dataset. Second, we made the important 
observation that in order to get a performance improve- 
ment a large active set was needed. For training on the 
augmented dataset we increased p; nc from 0.6 to 0.99 for 
USPS and 0.9 for MNIST. We conjecture that we can get 
even better performance — at the expense of a substantial 
increase in complexity, by increasing pi nc in the initial run 
to get a larger initial active set to work with. 



Results in Table 1 show that performance-wise, PASS- 
GP reached 3.35 ± 0.03% for USPS and 0.86 ± 0.02% for 
MINST on the multi-class task, what is comparable to 
state-of-the-art techniques. For instance SVM obtained 
3.2% on USPS and 0.68% on MNIST with an equivalent 
procedure. The difference in performance is probably due 
to our active set not being large enough, since support set 
sizes reported for SVMs are typically twice as large [18]. 

6.4. I J CNN 

As final experiment, we want to compare fPASS-GP, 
RSVM and IVM on a common ground. For this purpose 
we use the IJCNN dataset which is widely used by the 
SVM research community. It consists of 49990 training ex- 
amples, 91701 test examples and each observation counts 
with 22 features. We consider M = {100, 200, 1000} 
with squared covariance function and fixed hyperparame- 
ters, the latter using the values suggested in [3], that is 6 = 
[1 1/8 1/16] for f-PASSGP and IVM, and 9 = [1 1/8 0], 
C = 16 for RSVM. For IVM we include a linear term as 
in the previous experiments with 64 = 1. Besides, each 
setting was repeated 10 times to collect statistics. Figure 
6 summarizes the results obtained. More specifically, Fig- 
ure 6(a) shows the mean classification error as a function 
of the active set. We can see that fPASS-GP is slightly 
better than RSVM and IVM in the entire range of M, be- 
sides the former seems to be particularly good for small 
values of M . When we plot mean errors as a function of 
running times — as a proxy for the computational cost, we 
see that there exist two regimes, one for small values of M 
where fPASS-GP outperforms RSVM and IVM, and the 
other where the cubic complexity of the GPCs start hurt- 
ing fPASS-GP, thus letting RVM and IVM with a better 
error-cost trade-off. 

7. Discussion 

We have proposed a framework for active set selection 
in GPC. The core of our active set update rule is that 
the predictive distribution of a GPC can be used to quan- 
tify the relative weight of points in the active set that can 
be marked for deletion or new points from the active set 
with low predictive probabilities, that make them ideal 
for inclusion. The algorithmic skeleton of our framework 
consists on two alternating steps, namely active set up- 
dates and hyperparamctcr optimization. We designed two 
active set update criteria that target two different prac- 
tical scenarios. The first we called PASS-GP focuses on 
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Figure 6: Error rates and run times for IJCNN data, (a) Mean 
classification error as a function of the active set size using fPASS-GP, 
RSVM and IVM. (b) Mean classification error as a function of the 
run time. Error bars correspond to standard deviations computed 
over 10 repetitions of the experiment. 



intcrpretability of the parameters of the update rule by 
thresholding the predictive distributions of GPC. The sec- 
ond acknowledges that in some applications having a fixed 
computational cost is key, thus fPASS-GP keeps the size 
of the active set fixed so the overall cost and memory re- 
quirements can be known beforehand. 

We presented theoretical and practical support that our 
active set selection strategy is efficient while still retain- 
ing the most appealing benefits of GPC: prediction un- 
certainty, model selection, prior knowledge leverage and 
state-of-the-art performance. Compared to other approxi- 
mative methods, although slower than IVM [8] and RSVM 
[3], PASS-GP provides better results. We did not consider 
any auxiliary set method like FITC [4] because for task 
of the size like for example MNIST or IJCNN, it is pro- 
hibitive. Additionally, we have noticed in practice that 
our approximation is quite insensitive to the initial active 
set selection and also that more than two or three passes 
through the data do not yield improved performance nor 
large active set sizes. The code used in this work is based 
on the Matlab toolbox provided with [2] and is publicly 
available at http : / / cogsys . imm . dtu . dk/passgp. 

The not so satisfying feature of active set approxima- 
tions, is that we are ignoring some of the training data. 
Although some of our findings on the USPS data set ac- 



tually suggest that this can be beneficial for performance, 
it is of interest to make a modified version where the in- 
active set is used approximately in a cost efficient way. 
The representer theorem for the mean prediction and the 
approximations for marginal likelihood discussed in this 
paper might give inspiration for such methods. In con- 
clusion, efficient methods for GPs are still much in need 
when the data is abundant such as in ordinal regression 
for collaborative filtering. 
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